Approximated \(\mathbb{E}_F[g(X)]\) by \[\frac{1}{B}\sum_{b=1}^B g(X_b), \qquad X_1,\ldots,X_B \stackrel{i.i.d.}{\sim} F.\]
Justified by the WLLN when \(\mathbb{E}_F[|g(X)|] < \infty\);
MC standard error of order \(1/\sqrt{B}\).
Today
Use MC simulation to:
Evaluate finite-sample properties of estimators and inference procedures;
Illustrate and visualize theoretical results for estimators and inference procedures.
Why? Asymptotic theory tells us what happens as \(n \to \infty\). MC simulation tells us how well those approximations work at a given sample size, and makes abstract results concrete.
Today (cont’d)
Why? Asymptotic theory tells us what happens as \(n \to \infty\). MC simulation tells us how well those approximations work at a given sample size, and makes abstract results concrete.
Limitation: MC results are specific to the chosen \(F\) and \(n\). Unlike analytical asymptotic results, which characterize behavior across a class of DGPs, each MC experiment provides evidence for one particular design. To build broader understanding, we explore a range of DGPs and sample sizes.
Theory Guides MC Design
MC simulation is most informative when guided by theory.
Theory identifies where problems should arise — MC quantifies how bad they are at specific \((n, F)\).
Theory identifies which parameters matter — this guides us on what to vary in the simulation.
Example: The delta method for \(g(\hat{\theta})\) relies on \(g\) being approx. linear over the region where \(\hat{\theta}\) lies with high probability. For \(g(x) = 1/x\), the curvature \(g''(x) = 2/x^3\) is large near zero — so theory predicts the approx. will be poor when \(\theta\) is small relative to the SE of \(\hat{\theta}\). This tells us which DGPs to examine.
Theory Guides MC Design (cont’d)
A good MC study follows a pattern:
Theory identifies the key quantities that govern performance (e.g., \(\mu/(\sigma/\sqrt{n})\) for the delta method applied to \(1/\bar{X}\)).
MC at one DGP confirms the theory and reveals the magnitude of the problem (for given DGP).
MC across DGPs varies the key quantities to map out where asymptotic approximations perform better/worse.
We focus on steps 1–2 for each example, but we note what a fuller MC study would explore.
Setup and Notation
Suppose we have \(n\) i.i.d. draws from a distribution \(F\): \[W_1, \ldots, W_n \stackrel{i.i.d.}{\sim} F.\]
Here \(W_i\) is the full data vector for observation \(i\) — e.g., \(W_i = (Y_i, X_i)\) in a regression setting.
Let \(T_n = T(W_1,\ldots,W_n)\) be a statistic (an estimator, a test statistic, etc.).
Setup and Notation (cont’d)
The sampling distribution of \(T_n\) is determined by \(n\) and \(F\). Denote its CDF by: \[G_n(u, F) \;=\; \mathbb{P}_F\!\left[\,T_n \le u\,\right].\]
\(G_n(\cdot, F)\) depends on the DGP \(F\) and the sample size \(n\).
Asymptotic theory approximates \(G_n(\cdot, F)\) for large \(n\) (e.g., by the CLT).
MC simulation approximates \(G_n(\cdot, F)\) directly, for any\(n\) and any\(F\) we can simulate from.
The MC Recipe
To approximate \(G_n(\cdot, F)\) by simulation:
For \(b = 1, \ldots, B\):
Draw a sample \(W_1^{(b)}, \ldots, W_n^{(b)} \stackrel{i.i.d.}{\sim} F\).
Compute the statistic \(T_n^{(b)} = T(W_1^{(b)}, \ldots, W_n^{(b)})\).
Approximate \(G_n(\cdot, F)\) by the empirical distribution of \(T_n^{(1)}, \ldots, T_n^{(B)}\), \[\hat{G}_{n,B}(u, F) \;=\; \frac{1}{B}\sum_{b=1}^B \mathbf{1}\!\left[T_n^{(b)} \le u\right].\]
The MC Recipe (cont’d)
Approximate \(G_n(\cdot, F)\) by the empirical distribution of \(T_n^{(1)}, \ldots, T_n^{(B)}\), \[\hat{G}_{n,B}(u, F) \;=\; \frac{1}{B}\sum_{b=1}^B \mathbf{1}\!\left[T_n^{(b)} \le u\right].\]
Any feature of \(G_n(\cdot, F)\) that can be expressed as \(\mathbb{E}_{G_n}[h(T_n)]\) can be approximated by \(\frac{1}{B}\sum_{b=1}^B h(T_n^{(b)})\).
The MC Recipe: Key Distinction
The researcher chooses\(F\) and \(n\) — these define the DGP.
The researcher chooses\(B\) — this controls MC accuracy.
\(n\) and \(B\) play very different roles:
\(n\) = sample size in each simulated dataset. Changing \(n\) changes the object\(G_n(\cdot, F)\) being studied.
\(B\) = number of MC replications. Changing \(B\) changes the precision of our approximation of \(G_n(\cdot, F)\).
Evaluating Estimators
Suppose \(T_n = \hat{\theta}_n\) is an estimator of a parameter \(\theta(F)\).
From the MC draws \(\hat{\theta}_n^{(1)}, \ldots, \hat{\theta}_n^{(B)}\), approximate:
\(\tilde{p}\) shrinks toward \(1/2\) — adds one “pseudo-success” and one “pseudo-failure.”
Shrinkage is of order \(1/n\).
Equivalently, \(\tilde{p}\) is the posterior mean under a \(\text{Beta}(1,1)\) (i.e., uniform) prior.
We will return to shrinkage estimators later in the course (ridge regression, lasso).
Analytical Properties
Since \(\sum W_i \sim \text{Binomial}(n, p)\):
\(\hat{p}\) (MLE)
\(\tilde{p}\) (Laplace)
Bias
\(0\)
\(\frac{1 - 2p}{n+2}\)
Variance
\(\frac{p(1-p)}{n}\)
\(\frac{n \, p(1-p)}{(n+2)^2}\)
\(\tilde{p}\) is biased toward \(1/2\): the bias is positive when \(p < 1/2\), negative when \(p > 1/2\), and zero at \(p = 1/2\).
\(\tilde{p}\) has strictly smaller variance than \(\hat{p}\) since \(\frac{n}{(n+2)^2} < \frac{1}{n}\).
MSE comparison depends on \(p\) and \(n\).
What Theory Tells Us
The bias of \(\tilde{p}\) is \((1 - 2p)/(n+2)\) — largest when \(p\) is far from \(1/2\), and shrinks with \(n\).
The variance reduction factor is \(n/(n+2)^2 \div 1/n = n^2/(n+2)^2\) — nearly 1 for large \(n\).
Theory guides the DGP choice: we pick \(n = 50\) and \(p = 0.05\) (far from \(1/2\), moderate \(n\)) to make the bias-variance tradeoff visible.
Estimating \(p\): Bias and MSE
set.seed(5551)B <-100000; n <-50; p <-0.05sim_p <-function(n, p) { W <-rbinom(1, n, p) # sum of n Bernoulli draws p_hat <- W / n p_tilde <- (W +1) / (n +2)c(p_hat = p_hat, p_tilde = p_tilde)}results <-replicate(B, sim_p(n, p))p_hat_draws <- results["p_hat", ]p_tilde_draws <- results["p_tilde", ]
\(\hat{p}\) is unbiased but has substantial mass at zero. \(\tilde{p}\) is biased toward \(1/2\) but has lower variance and no mass at zero.
MC vs. Exact: Validation
Quantity
Estimator
Exact
MC
Bias
MLE
0.00000
-0.00001
Variance
MLE
0.00095
0.00095
MSE
MLE
0.00095
0.00095
Bias
Laplace
0.01731
0.01730
Variance
Laplace
0.00088
0.00088
MSE
Laplace
0.00118
0.00118
MC estimates closely match the exact values — the small discrepancies are MC noise of order \(1/\sqrt{B}\).
Good practice: when exact results are available, use them to validate your simulation code before moving to settings where MC is the only option.
Bias–Variance Tradeoff Across \(p\)
Sample mean (MLE) has lower MSE when \(p\) is near 0 or 1 where the bias of \(\tilde{p}\) is largest; Laplace has lower MSE near \(p = 1/2\) .
Example 1: Fuller Analysis
A more complete MC study would additionally vary \(n\):
For large \(n\), the shrinkage of order \(1/n\) becomes negligible — the two estimators converge.
Laplace differs more from MLE when \(p\) and \(n\) are small.
Theory tells us the key quantity is \(np\) (expected number of successes): when \(np\) is small, the discrete nature of \(\hat{p}\) dominates; when \(np\) is large, the normal approximation becomes more reliable and bias-variance tradeoffs are modest.
Example 2: Estimating \(1/p\)
Now consider \(\theta(p) = 1/p\). Arises in, e.g., estimating expected number of trials until first success.
Exact expressions for bias, variance, and the distribution of \(1/\hat{p}\) and \(1/\tilde{p}\) can be written as finite sums over the binomial distribution, but they do not simplify to informative closed-form formulas.
But delta method relies on \(g(x) = 1/x\) being approx. linear over the region where \(\hat{p}\) lies with high probability.
Since \(g''(x) = 2/x^3\), the nonlinearity is severe near \(x = 0\). The approximation is poor when \(p\) is small and \(\hat{p}\) has substantial mass near zero.
Delta Method: Visualization
The delta method approximates \(g(x) = 1/x\) by its tangent line at \(\theta = 1/p\). This works when \(\hat{p}\) concentrates in a region where the tangent line is close to \(g\).
Delta Method: Visualization
Estimating \(1/p\): Simulation
set.seed(5551)B <-10000; n <-50; p <-0.05theta <-1/p # 1/p = 20sim_inv_p <-function(n, p) { W <-rbinom(1, n, p) p_hat <- W / n p_tilde <- (W +1) / (n +2)# 1/p_hat is Inf when W = 0; exclude those inv_hat <-ifelse(W >0, 1/ p_hat, NA) inv_tilde <-1/ p_tildec(inv_hat = inv_hat, inv_tilde = inv_tilde, W = W)}results <-replicate(B, sim_inv_p(n, p))inv_hat_draws <- results["inv_hat", ]inv_tilde_draws <- results["inv_tilde", ]W_draws <- results["W", ]
Estimating \(1/p\): Simulation
sim_inv_p <-function(n, p) { W <-rbinom(1, n, p) p_hat <- W / n p_tilde <- (W +1) / (n +2)# 1/p_hat is Inf when W = 0; exclude those inv_hat <-ifelse(W >0, 1/ p_hat, NA) inv_tilde <-1/ p_tildec(inv_hat = inv_hat, inv_tilde = inv_tilde, W = W)}results <-replicate(B, sim_inv_p(n, p))inv_hat_draws <- results["inv_hat", ]inv_tilde_draws <- results["inv_tilde", ]W_draws <- results["W", ]# Fraction of replications where p_hat = 0frac_zero <-mean(W_draws ==0)frac_zero
Smoothing stabilizes the denominator, reducing MSE.
Delta Method vs. MC Distributions
Delta Method vs. MC Distributions
The delta method approximation for \(1/\hat{p}\): \[\frac{1}{\hat{p}} \;\overset{\cdot}{\sim}\; N\!\left(\frac{1}{p},\;\; \frac{1-p}{n\,p^3}\right).\] The delta method normal (red curve previous slide)
is centered too low (misses the upward bias of \(1/\hat{p}\))
is too narrow (understates dispersion).
misses right-skew of sampling distribution.
Variability of \(1/\hat{p}\): Delta Method vs. MC
For \(p=0.05\), \(n=50\),
Method
Std. Dev.
Delta method
12.33
MC (excl. p-hat = 0)
14.37
Delta method understates the variability of \(1/\hat{p}\) because it relies on approximate linearity of \(1/x\) over the region where \(\hat{p}\) lies with high probability.
When \(p\) is small, \(\hat{p}\) lies with high prob. in a region where \(1/x\) is highly nonlinear — delta method approx. is poor at fixed \(n\), though it improves as \(n \to \infty\) (as \(\hat{p}\) concentrates around \(p\)).
Delta Method: Inference
The delta method suggests the 95% CI: \(\;\;1/\hat{p} \;\pm\; z_{\alpha/2}\,\sqrt{(1-\hat{p})\,/\,(n\,\hat{p}^3)}\).
How does coverage depend on \(p\) (with \(n = 50\))?
When an estimator involves a ratio, and the denominator can be small, be cautious about:
The delta method for variance and distributional approximations.
Plug-in estimators that can explode.
Regularization (here, Laplace smoothing) stabilizes the denominator — a theme we will return to with ridge regression and lasso.
Example 2: Fuller Analysis
Theory identifies the key quantity: the delta method approximation for \(g(\hat{p})\) depends on \(g\) being approximately linear over the region where \(\hat{p}\) lies with high probability.
For \(g(x) = 1/x\) with \(g''(x) = 2/x^3\), the curvature is large near zero.
So the delta method should improve as \(p\) moves away from zero (less mass near zero) and as \(n\) increases (the distribution of \(\hat{p}\) concentrates).
A fuller MC study would vary \(n\) and \(p\) systematically.
Example 3: Nonexistent Moments
Recall from last lecture: MC approximation of \(\mathbb{E}[g(X)]\) fails when \(\mathbb{E}[|g(X)|] = \infty\).
Can this arise with familiar estimators?
Consider \(X_i \stackrel{i.i.d.}{\sim} N(\mu, \sigma^2)\) and the estimand \(\theta = 1/\mu\).
Plug-in estimator: \(T_n = 1/\bar{X}\).
\(\bar{X} \sim N(\mu, \sigma^2/n)\)\(\Rightarrow\) the density of \(\bar{X}\) is bounded away from zero in a neighborhood of the origin \(\Rightarrow \mathbb{E}[|1/\bar{X}|] = \infty\).
Nonexistent Moments: Not Just Division by Zero
The issue is not that \(\bar{X}\) can equal zero (it does so with probability zero for continuous \(\bar{X}\)).
The issue is that \(\bar{X}\) has density bounded away from zero in a neighborhood of \(0\)\(\Rightarrow\)\(\int |1/x| \, \phi(x;\mu,\sigma^2/n) \, dx = \infty\).
This is true even when \(\mu\) is far from zero, but the practical impact depends on \(\mu / (\sigma/\sqrt{n})\): when this ratio is large, the density of \(\bar{X}\) puts so little probability mass in a neighborhood of zero that the nonexistent moment has no practical consequence.
set.seed(5551)B <-10000; n <-20; sigma <-1sim_inv_mean <-function(n, mu, sigma) { xbar <-mean(rnorm(n, mu, sigma))1/ xbar}# mu = 2: ratio mu/(sigma/sqrt(n)) = 2/(1/sqrt(20)) ≈ 8.9inv_mean_far <-replicate(B, sim_inv_mean(n, mu =2, sigma))# mu = 0.2: ratio ≈ 0.89inv_mean_near <-replicate(B, sim_inv_mean(n, mu =0.2, sigma))
\(\mu = 2\): No Diagnostic Red Flags
With \(\mu=2\), the running mean looks stable — \(\mathbb{E}[1/\bar{X}]\) doesn’t exist, but the diagnostic red flags are absent because \(\bar{X}\) so rarely comes close to zero.
\(\mu = 0.2\): Diagnostic Red Flag
With \(\mu=0.2\), the running mean shows erratic jumps — the diagnostics from last lecture flag this.
How stability of \(1/\bar{X}_n\) depends on \(\mu\).
Since \(\bar{X}_n \sim N(\mu, \sigma^2/n)\), we can compute the probability that \(\bar{X}_n\) passes near zero, where \(1/\bar{X}_n\) explodes. With \(n=20\), \(\sigma = 1\):
\(\mu\,/\,(\sigma/\sqrt{n})\)
\(\Pr(\vert\bar{X}_n\vert < 0.1)\)
\(\mu = 2\)
8.9
0.000000
\(\mu = 0.2\)
0.9
0.237504
With \(\mu=2\), can show \(\Pr( | \bar{X}_N | < 0.1)< 10^{-17}\) by the Mill’s ratio bound, floating-point arithmetic cannot distinguish it from zero.
How stability of … (cont’d)
Since \(\bar{X}_n \sim N(\mu, \sigma^2/n)\), we can compute the probability that \(\bar{X}_n\) passes near zero, where \(1/\bar{X}_n\) explodes. With \(n=20\), \(\sigma = 1\):
\(\mu\,/\,(\sigma/\sqrt{n})\)
\(\Pr(\vert\bar{X}_n\vert < 0.1)\)
\(\mu = 2\)
8.9
0.000000
\(\mu = 0.2\)
0.9
0.237504
When \(\mu/(\sigma/\sqrt{n})\) is large, \(\bar{X}_n\) approaches zero so rarely that the nonexistent moment has no practical consequence.
When \(\mu/(\sigma/\sqrt{n})\) is small, \(\bar{X}_n\) passes near zero often enough to destabilize \(1/\bar{X}_n\).
Inference Consequences with \(n=20\)
The delta method CI: \(\;\;1/\bar{X}_n \;\pm\; z_{\alpha/2}\,\cdot\, s\,/\,(\sqrt{n}\,\bar{X}_n^2)\).
\(\mu\,/\,(\sigma/\sqrt{n})\)
Coverage (nominal 95%)
\(\mu = 2\)
8.9
0.937
\(\mu = 0.2\)
0.9
0.792
When \(\mu/(\sigma/\sqrt{n})\) is large, the delta method CI performs adequately despite the nonexistent moment.
When \(\mu/(\sigma/\sqrt{n})\) is small, coverage degrades — the poor normal approximation to the sampling distribution translates directly into unreliable inference.
\(1/\bar{X}\): Delta Method vs. MC Distribution
The delta method predicts \(\sqrt{n}\,(1/\bar{X} - 1/\mu)\,/\sqrt{\sigma^2/\mu^4} \;\overset{\cdot}{\sim}\; N(0,1).\)
MC Simulation
Quantile
\(N(0,1)\)
\(\mu = 2\)
\(\mu = 0.2\)
2.5%
-1.96
-1.59
-9.24
5%
-1.64
-1.37
-4.97
10%
-1.28
-1.11
-2.71
25%
-0.67
-0.60
-0.57
50%
0.00
0.01
-0.32
75%
0.67
0.74
0.20
90%
1.28
1.47
1.55
95%
1.64
2.00
3.59
97.5%
1.96
2.53
7.09
\(1/\bar{X}\): Delta Method vs. MC Distribution
\(\mu = 2\) (\(\mu/\text{SE} = 8.9\)): MC quantiles closely match \(N(0,1)\) — the delta method approximation is accurate.
\(\mu = 0.2\) (\(\mu/\text{SE} = 0.89\)): the distribution is heavily left-skewed with extreme left-tail quantiles — the normal approximation fails qualitatively, not just in scale.
Takeaway: Nonexistent Moments
\(\mathbb{E}[1/\bar{X}]\) does not exist for any\(\mu\) when \(X_i \sim N(\mu,\sigma^2)\).
When \(\mu/(\sigma/\sqrt{n})\) is large, the nonexistence is a mathematical fact with no practical consequences for MC.
When \(\mu/(\sigma/\sqrt{n})\) is small, the nonexistence manifests in unstable MC estimates.
Example 3 (cont’d)
Theory identifies \(\mu/(\sigma/\sqrt{n})\) as the key quantity governing the practical severity of the nonexistent moment.
For a more complete analysis, would examine how coverage of the delta method CI degrades as \(\mu/(\sigma/\sqrt{n}) \to 0\), generating a coverage curve indexed by this ratio.
Example 3 (cont’d)
Theory identifies \(\mu/(\sigma/\sqrt{n})\) as the key quantity governing the practical severity of the nonexistent moment.
The same structure arises whenever an estimator involves a denominator that can pass near zero. When the denominator is approximately normally distributed and centered at its true value, the ratio of the true value to its SE is the natural index — it measures how many standard deviations separate the denominator from zero.
Connection to IV Estimation
The same structure arises w/ IV estimation: the exactly-identified IV estimator can have no finite moments regardless of instrument strength; though instrument strength affects how practically severe the tail problem is at given \(n\).
Separately, when instruments are weak, the asymptotic normal approximation to the sampling distribution of the IV estimator is poor — a failure of \(G_n(\cdot, F) \approx N(\cdot)\), not just a moment problem.
Ex. 4: Inference w/ Unreliable s.e.
Consider the linear regression \(Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\) with heteroskedastic errors: \(\text{Var}(\varepsilon_i \mid X_i) = \sigma^2 X_i^2.\)
OLS \(\hat{\beta}_1\) is still consistent and asymptotically normal, but inference depends on the standard error:
Homoskedastic SE validity based on \(\text{Var}(\varepsilon_i \mid X_i) = \sigma^2\); inconsistent here.
HC1 (White) robust SE — consistent under heteroskedasticity.
HC3 SE — finite-sample correction.
Example 4: Theory Guides the DGP
With \(\text{Var}(\varepsilon_i \mid X_i) = \sigma^2 X_i^2\), the degree of heteroskedasticity depends on the distribution of\(X_i\).
If \(X_i\) has a wide range, \(X_i^2\) varies a lot
\(\Rightarrow\) severe heteroskedasticity.
If \(X_i\) is nearly constant, \(X_i^2 \approx c\)
Homoskedastic SE: over-rejection that does not vanish with \(n\) — consistent with theory (the SE converges to the wrong limit).
HC0 (White): consistent, but over-rejects at small \(n\) — the finite-sample bias of the sandwich estimator matters.
HC1: the \(n/(n-k)\) correction helps slightly.
HC3: best finite-sample performance — the leverage-based correction addresses the downward bias in HC0.
Visualizing the Problem: \(t\)-Statistic Distributions
With the homoskedastic SE, the \(t\)-statistic is too dispersed — the SE underestimates the true variability of \(\hat{\beta}_1\). The \(N(0,1)\) reference distribution is too narrow.
Takeaway: Standard Errors Matter
The estimator \(\hat{\beta}_1\) can be perfectly fine (consistent, asymptotically normal), yet inference breaks down if the SE is unreliable.
An inconsistent SE (wrong formula) causes size distortions that do not vanish with \(n\).
A consistent but imprecise SE (HC1 at small \(n\)) causes size distortions that vanish as \(n \to \infty\), but may be substantial at given \(n\).
HC3 provides a finite-sample correction.
Example 4: Fuller Analysis
Theory tells us the severity of the problem depends on:
Degree of heteroskedasticity: how much \(\text{Var}(\varepsilon_i \mid X_i)\) varies across \(i\). In our DGP, this is controlled by the distribution of \(X_i\).
Leverage: observations with high leverage (extreme \(X_i\) values) have disproportionate influence on the sandwich estimator’s finite-sample bias.
Sample size: HC0 and HC1 are consistent, so their over-rejection vanishes — but the question is how large the distortions are at given \(n\).
Example 4: Fuller Analysis (cont’d)
A fuller MC study would vary:
the heteroskedasticity function
the \(X\) distribution
e.g., try different distributions with the same variance but different tail behavior to see how leverage affects the bias of the sandwich estimator
Fit the full model; test \(H_0: \beta_2 = 0\) at level \(\alpha_{\text{sel}} = 0.05\).
If reject: keep the full model. If fail to reject: drop \(X_2\).
Report \(\hat{\beta}_1\) from the selected model.
This is a natural workflow — researchers routinely test whether to include covariates.
But the resulting \(\hat{\beta}_1^{\text{post}}\) is a random mixture of two different estimators, and its distribution is generally not normal.
Example 5: Theoretical Background
Under the normal-error DGP with fixed \(X\): \[(\hat\beta_1, \hat\beta_2) \sim N\!\left(\beta,\; \sigma^2 (X'X)^{-1}\right)\]
In this two regressor setting, \(\text{Corr}(X_1, X_2) > 0\)\(\Rightarrow\) the off-diagonals of \((X'X)^{-1}\) are negative \(\Rightarrow\)\(\text{Cov}(\hat\beta_1, \hat\beta_2) < 0\).
Selection event: \(\{|t_2| \leq c\}\) truncates the range of \(\hat\beta_2\). Since \(\text{Corr}(\hat\beta_1, \hat\beta_2)<0\), this truncation shifts the conditional distribution of \(\hat\beta_1\).
OVB when\(X_2\) is dropped: the short regression estimator satisfies \(\tilde\beta_1 \xrightarrow{p} \beta_1 + \gamma \beta_2\) where \(\gamma = \text{Cov}(X_1,X_2)/\text{Var}(X_1)\).
Example 5: What Theory Predicts
The post-selection estimator \(\hat\beta_1^{\text{post}}\) is a mixture:
\(\beta_2\) intermediate: worst case — the test frequently fails to reject, selecting the short model (which has OVB), and conditioning on the selection event biases both components.
When \(\beta_2 = 0\), the post-selection estimator is fine: both the long and short regression are unbiased for \(\beta_1\) when \(\beta_2 = 0\), and the symmetry of the selection event around \(\hat{\beta}_2 = 0\) preserves unbiasedness of the mixture.
Now the post-selection estimator is biased — from both OVB in the short model and conditioning on the selection event — and has poor coverage.
Post-Selection: Sampling Distribution
The post-selection distribution is a mixture: shifted left when \(X_2\) is dropped (OVB), centered when kept. This is the theory made visible.
The always-full estimator has a clean normal distribution.
Post-Selection: Varying \(\beta_2\)
Coverage of the post-selection CI is worst for intermediate \(\beta_2\) — exactly as theory predicted: large enough to cause OVB when dropped, but small enough that the test often fails to detect it.
Post-Selection: Why Does This Happen?
Validity of standard CI based upon:
The model was known in advance (not selected from the data).
The estimator has a single normal sampling distribution.
Post-Selection: Why Does This Happen? (cont’d)
But the post-selection estimator:
Is a mixture of two estimators (full and restricted), with mixture weights that depend on the data.
Has a sampling distribution that is generally not normal. Under local alternatives \(\beta_2 = c/\sqrt{n}\), the mixture persists asymptotically.
The standard SE does not account for the model selection step.
Example 5: Fuller Analysis
Theory identifies several quantities that govern the severity of the problem:
Correlation\(\rho\): higher \(|\rho|\)\(\Rightarrow\) larger OVB \(= \gamma \beta_2\) when \(X_2\) is dropped, and stronger dependence between \(\hat\beta_1\) and the selection event.
Signal-to-noise ratio\(\beta_2/\sigma\): determines the power of the selection test, and hence the mixing proportions.
Sample size\(n\): more power to detect \(\beta_2 \ne 0\), but the problem doesn’t vanish — the dip shifts toward smaller \(\beta_2\) as \(n\) grows.
A fuller MC study would vary \(\rho\), \(n\), and the number of candidate regressors (the problem compounds with more selection steps).
Summary of Examples
Example
Key Lesson
\(\hat{p}\) vs. Laplace
Bias–variance tradeoff; shrinkage can reduce MSE
\(1/\hat{p}\) vs. \(1/\tilde{p}\)
Small denominators degrade delta method approximation; regularization helps
\(1/\bar{X}\)
Nonexistent moments; diagnostics from Part 1
Summary of Examples (cont’d)
Example
Key Lesson
Heteroskedastic SEs
Wrong SE \(\Rightarrow\) invalid inference, even with good estimator
In every case, theory identified the problem and MC simulation revealed its magnitude at specific \((n, F)\).
Connection to \(G_n(\cdot, F)\)
All the quantities we’ve computed are features of \(G_n(\cdot, F)\):
Bias: \(\int u \, dG_n(u, F) - \theta(F)\).
MSE: \(\int (u - \theta)^2 \, dG_n(u, F)\).
Size: \(G_n(-z_{\alpha/2},\, F_0) + (1 - G_n(z_{\alpha/2},\, F_0))\), for the \(t\)-statistic under \(H_0\).
Power: same expression, evaluated at \(F_1 \ne F_0\).
Coverage: \(G_n\) applied to the CI endpoints.
MC simulation approximates \(G_n(\cdot, F)\) by its empirical analog \(\hat{G}_{n,B}(\cdot, F)\).
The MC Recipe: Summary
Looking Ahead: The Bootstrap
In MC simulation, we know \(F\) and approximate \(G_n(\cdot, F)\) by simulation.
In practice, \(F\) is unknown — we observe only one sample \(Y_1,\ldots,Y_n\).
The bootstrap idea: replace the unknown \(F\) with an estimate \(\hat{F}_n\) (e.g., the empirical distribution), then approximate \(G_n(\cdot, \hat{F}_n)\) by simulation.
The MC recipe is exactly the same — only the input changes from \(F\) to \(\hat{F}_n\).
Looking Ahead (cont’d)
MC Simulation
Bootstrap
Distribution
\(F\) (known, chosen)
\(\hat{F}_n\) (estimated from data)
Draws
\(Y_i^{(b)} \sim F\)
\(Y_i^{*(b)} \sim \hat{F}_n\)
Statistic
\(T_n^{(b)} = T(Y^{(b)})\)
\(T_n^{*(b)} = T(Y^{*(b)})\)
Approximates
\(G_n(\cdot, F)\)
\(G_n(\cdot, \hat{F}_n)\)
The bootstrap works because \(\hat{F}_n \to F\), so \(G_n(\cdot, \hat{F}_n) \approx G_n(\cdot, F)\). The computational machinery is identical — only the conceptual justification differs.
Practical Advice
Choosing\(B\): for size/coverage, \(B = 10{,}000\) gives MC standard error \(\le 0.005\). For power curves, \(B = 5{,}000\) per point is often sufficient.
Choosing\(n\): examine a range of \(n\) to see how finite-sample behavior approaches the asymptotic benchmark.
Choosing\(F\): try multiple DGPs — normal errors, skewed errors, heavy tails, heteroskedasticity — to stress-test the method. Let theory guide which features of \(F\) to vary.
Report MC standard errors: treat MC estimates as estimates, not exact quantities.
Summary
MC simulation lets us evaluate estimators and inference procedures at any finite\(n\) and for any\(F\) we can draw from.
The sampling distribution \(G_n(\cdot, F)\) is the central object. MC approximates it via repeated simulation.
Theory and MC are complements: theory identifies the key quantities and predicts where problems arise; MC quantifies how severe they are at specific \((n, F)\).
The same computational framework — simulate from a known distribution and compute statistics — underlies the bootstrap, which we will study next.